New York City is the most populated city in the United States. It is a diverse city that attracts lots of Bussinesses each year. One of the most popular bussinesses in NYC is restaurant. There are enough restaurants in NYC that you can eat for 23 years without eating in a restaurant twice. The large number of restaurants doesn't mean that all of them are successful. To thrive in such an environment, you need to do intensive study before openning one. Let's assume that we want to add a Italian Restaurant to the pile of restaurants in NYC. We would like to know where is the best place to open it.
Before we get the data and start exploring it, let's download all the dependencies that we will need.
import numpy as np # library to handle data in a vectorized manner
import pandas as pd # library for data analsysis
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import json # library to handle JSON files
# !conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # convert an address into latitude and longitude values
import requests # library to handle requests
from pandas import json_normalize # tranform JSON file into a pandas dataframe
# Matplotlib and associated plotting modules
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
# import k-means from clustering stage
from sklearn.cluster import KMeans
# !conda install -c conda-forge folium=0.5.0 --yes
import folium # map rendering library
print('Libraries imported.')
New York has a total of 5 boroughs and 306 neighborhoods. We need to have a dataset that includes the essential information about each of these neighborhoods. Part of the data used in this project is extracted from the Foursquare as we go forward. So, the only information that we need from each neighborhood is it's location.
Luckily, this dataset exists for free on the web. Here is the link to the dataset: https://geo.nyu.edu/catalog/nyu_2451_34572 Due to being one of the largest cities in US, there are lots of more data available for this city that can be found in internet.
For your convenience, I downloaded the files and placed it on the repository. Let's load it.
with open('newyork_data.json') as json_data:
newyork_data = json.load(json_data)
Let's take a quick look at the data.
type(newyork_data)
newyork_data.keys()
Notice how all the relevant data is in the features key, which is basically a list of the neighborhoods. So, let's define a new variable that includes this data.
neighborhoods_data = newyork_data['features']
Let's take a look at the first item in this list.
neighborhoods_data[0]
As we can see, for each neighborhood, this dataset provides its name, borough, and location. We need to extract this information and convert them to a form that can be used in Python.
# define the dataframe columns
column_names = ['Borough', 'Neighborhood', 'Latitude', 'Longitude']
# instantiate the dataframe
NYC_neighborhoods = pd.DataFrame(columns=column_names)
for data in neighborhoods_data:
borough = neighborhood_name = data['properties']['borough']
neighborhood_name = data['properties']['name']
neighborhood_latlon = data['geometry']['coordinates']
neighborhood_lat = neighborhood_latlon[1]
neighborhood_lon = neighborhood_latlon[0]
NYC_neighborhoods = NYC_neighborhoods.append({'Borough': borough,
'Neighborhood': neighborhood_name,
'Latitude': neighborhood_lat,
'Longitude': neighborhood_lon}, ignore_index=True)
Let's look at the dataframe to confirm that it's correct.
NYC_neighborhoods.head()
And make sure that the dataset has all 5 boroughs and 306 neighborhoods.
print('The dataframe has {} boroughs and {} neighborhoods.'.format(
len(NYC_neighborhoods['Borough'].unique()),
NYC_neighborhoods.shape[0]
)
)
We will use this information to show the map of the new york city.
address = 'New York City, NY'
geolocator = Nominatim(user_agent="ny_explorer")
location = geolocator.geocode(address)
latitude = location.latitude
longitude = location.longitude
print('The geograpical coordinate of New York City are {}, {}.'.format(latitude, longitude))
To get a better understanding of the data, let's show an interactive map of the city with neighborhoods.
# create map of New York using latitude and longitude values
map_newyork = folium.Map(location=[latitude, longitude], zoom_start=10)
# add markers to map
for lat, lng, borough, neighborhood in zip(NYC_neighborhoods['Latitude'], NYC_neighborhoods['Longitude'], NYC_neighborhoods['Borough'], NYC_neighborhoods['Neighborhood']):
label = '{}, {}'.format(neighborhood, borough)
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
[lat, lng],
radius=4,
weight=1,
popup=label,
color='white',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_newyork)
map_newyork
Next, we are going to start utilizing the Foursquare API to explore the neighborhoods and segment them.
import os
from dotenv import load_dotenv
load_dotenv()
CLIENT_ID = os.getenv('CLIENT_ID')
CLIENT_SECRET = os.getenv('CLIENT_SECRET')
VERSION=20200222
Let's show the workflow for a neighborhood.
print('The neighborhood is', NYC_neighborhoods.loc[0, 'Neighborhood'], 'in', NYC_neighborhoods.loc[0, 'Borough'],'.')
neighborhood_latitude = NYC_neighborhoods.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude = NYC_neighborhoods.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name = NYC_neighborhoods.loc[0, 'Neighborhood'] # neighborhood name
print('It\'s latitude and longitude values are {}, {}.'.format(neighborhood_latitude,
neighborhood_longitude))
# type your answer here
search_query=''
radius= 500
LIMIT= 200
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
neighborhood_latitude,
neighborhood_longitude,
radius,
LIMIT)
results = requests.get(url).json()
# function that extracts the category of the venue
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']
if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']
venues = results['response']['groups'][0]['items']
nearby_venues = json_normalize(venues) # flatten JSON
# filter columns
filtered_columns = ['venue.name', 'venue.categories', 'venue.id', 'venue.location.lat', 'venue.location.lng']
nearby_venues =nearby_venues.loc[:, filtered_columns]
# filter the category for each row
nearby_venues['venue.categories'] = nearby_venues.apply(get_category_type, axis=1)
# clean columns
nearby_venues.columns = [col.split(".")[-1] for col in nearby_venues.columns]
print('{} venues were returned by Foursquare.'.format(nearby_venues.shape[0]))
nearby_venues
def getNearbyVenues(names, latitudes, longitudes, radius=500):
venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
# print(name)
# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)
# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']
# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['id'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])
nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue id',
'Venue Latitude',
'Venue Longitude',
'Venue Category']
return(nearby_venues)
NY_venues = getNearbyVenues(names=NYC_neighborhoods['Neighborhood'],
latitudes=NYC_neighborhoods['Latitude'],
longitudes=NYC_neighborhoods['Longitude']
)
print('done!')
print('In total', NY_venues.shape[0], 'venues were found in New York')
NY_venues.head()
Let's check how many venues were returned for each neighborhood
Venue_count= NY_venues.groupby('Neighborhood').count()[['Venue']].sort_values('Venue', ascending=False)
Venue_count.reset_index(inplace=True)
Venue_count.head(10)
As we can see, the most venues were found in Murray Hill. Let's show the density of the venues in each neighborhood.
NY_geo = r'NTA.geojson' # geojson file
# create a plain world map
NY_map = folium.Map(location=[latitude, longitude], zoom_start=10)
# folium.Map(location=[latitude,longitude], zoom_start=11, tiles='Mapbox Bright')
# generate choropleth map using the total immigration of each country to Canada from 1980 to 2013
NY_map.choropleth(
geo_data=NY_geo,
data=Venue_count,
columns=['Neighborhood', 'Venue'],
key_on='feature.properties.ntaname',
fill_color='YlOrRd',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Number of Venues'
)
# display map
NY_map
print('There are {} uniques categories.'.format(len(NY_venues['Venue Category'].unique())))
Let's prepare the data for clustering. To do so, we would convert categories to columns where 1 means that category is in that neighborhood.
# one hot encoding
NY_onehot = pd.get_dummies(NY_venues[['Venue Category']], prefix="", prefix_sep="")
# add neighborhood column back to dataframe
NY_onehot['Neighborhood'] = NY_venues['Neighborhood']
# move neighborhood column to the first column
cols = list(NY_onehot)
cols.insert(0, cols.pop(cols.index('Neighborhood')))
NY_onehot = NY_onehot.loc[:, cols]
NY_onehot.head()
And let's examine the new dataframe size.
NY_onehot.shape
NY_grouped = NY_onehot.groupby('Neighborhood').mean().reset_index()
NY_grouped.head()
sorted_venue= pd.DataFrame(NY_grouped.mean().sort_values(ascending=False)[0:15], columns=['frequency'])
ax=sorted_venue.plot(kind='barh',rot=15,figsize=(6,6))
ax.invert_yaxis()
a = ax.barh(3, sorted_venue.iloc[3], height=0.5, color = 'red')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.legend([])
plt.xlabel('Frequency')
plt.title('Most Popular Venues in New York')
plt.show()
As we can see, Italian restaurant is the $4^{th}$ most popular venue in New York City.
NY_grouped.shape
First, let's write a function to sort the venues in descending order.
def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)
return row_categories_sorted.index.values[0:num_top_venues]
Now let's create the new dataframe and display the top 10 venues for each neighborhood.
num_top_venues = 10
indicators = ['st', 'nd', 'rd']
# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))
# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = NY_grouped['Neighborhood']
for ind in np.arange(NY_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(NY_grouped.iloc[ind, :], num_top_venues)
neighborhoods_venues_sorted.head()
Run k-means to cluster the neighborhood into 5 clusters.
# set number of clusters
kclusters = 7
NY_grouped_clustering = NY_grouped.drop('Neighborhood', 1)
# run k-means clustering
kmeans = KMeans(n_clusters=kclusters, random_state=0).fit(NY_grouped_clustering)
# check cluster labels generated for each row in the dataframe
kmeans.labels_[:]
Let's create a new dataframe that includes the cluster as well as the top 10 venues for each neighborhood.
# add clustering labels
try:
neighborhoods_venues_sorted.insert(0, 'Cluster Labels', kmeans.labels_)
except:
neighborhoods_venues_sorted['Cluster Labels']=kmeans.labels_
NY_merged = NYC_neighborhoods
# merge toronto_grouped with toronto_data to add latitude/longitude for each neighborhood
NY_merged = NY_merged.join(neighborhoods_venues_sorted.set_index('Neighborhood'), on='Neighborhood')
NY_merged=NY_merged.dropna()
NY_merged['Cluster Labels']= NY_merged['Cluster Labels'].astype('int')
NY_merged.head() # check the last columns!
Finally, let's visualize the resulting clusters
# create map
map_clusters = folium.Map(location=[latitude, longitude], zoom_start=10)
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_merged['Latitude'], NY_merged['Longitude'], NY_merged['Neighborhood'], NY_merged['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=3,
popup=label,
color=rainbow[cluster-1],
weight=1,
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(map_clusters)
map_clusters
NY_merged.head()
Now, you can examine each cluster and determine the discriminating venue categories that distinguish each cluster.
CL1=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==0].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL1.rename(columns={'Borough':'Count'})
This cluster includes the most popular venue in New York; Pizza Place.
CL2=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==1].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL2.rename(columns={'Borough':'Count'})
This is the cluster that we are interested in. The most popular venue in this cluster is the Italian Restaurant. Let's show this cluster on the map.
NY_map
# set color scheme for the clusters
x = np.arange(kclusters)
ys = [i + x + (i*x)**2 for i in range(kclusters)]
colors_array = cm.rainbow(np.linspace(0, 1, len(ys)))
rainbow = [colors.rgb2hex(i) for i in colors_array]
# add markers to the map
markers_colors = []
for lat, lon, poi, cluster in zip(NY_merged[NY_merged['Cluster Labels']==1]['Latitude'], NY_merged[NY_merged['Cluster Labels']==1]['Longitude'], NY_merged[NY_merged['Cluster Labels']==1]['Neighborhood'], NY_merged[NY_merged['Cluster Labels']==1]['Cluster Labels']):
label = folium.Popup(str(poi) + ' Cluster ' + str(cluster), parse_html=True)
folium.CircleMarker(
[lat, lon],
radius=3,
popup=label,
color=rainbow[cluster-1],
weight=1,
fill=True,
fill_color=rainbow[cluster-1],
fill_opacity=0.7).add_to(NY_map)
NY_map
The map shows the neighborhoods that are in this cluster on top of the venue density in New York. The best place to open a new italian restaurant is in these neighborhoods with denser venue numbers as more people go to these places.
CL3=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==2].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL3.rename(columns={'Borough':'Count'})
CL4=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==3].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL4.rename(columns={'Borough':'Count'})
CL5=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==4].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL5.rename(columns={'Borough':'Count'})
CL6=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==5].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL6.rename(columns={'Borough':'Count'})
CL7=pd.DataFrame(NY_merged[NY_merged['Cluster Labels']==6].groupby(['1st Most Common Venue']).count().sort_values('Neighborhood',ascending=False).iloc[:5,0])
CL7.rename(columns={'Borough':'Count'})